Do not repeat my commentary in your assignments. Any written work outside of the code chunks is just that. You wouldn’t copy and paste from another article, so do not plagerize the lab material without proper citation. I will take points off for said plagerism otherwise.
library(tidyverse)
library(stargazer)
library(knitr)
execs <- read.csv("data/lab04_execs_data.csv")
results='asis' in your code chunk, i.e. {r, results='asis'}header=FALSE in your Stargazer code to omit the citation.# To Knit PDF's Change This to type="latex"
stargazer(execs, header=FALSE, type='html',
font.size = "footnotesize",
digits = 2,
title = "Descriptive Statistics of Study Variables" )
| Statistic | N | Mean | St. Dev. | Min | Max |
| id | 1,000 | 3,052.00 | 803.00 | 1,702 | 4,476 |
| year | 1,000 | 2,009.00 | 0.00 | 2,009 | 2,009 |
| salary | 1,000 | 749.00 | 440.00 | 0.00 | 5,600.00 |
| bonus | 1,000 | 243.00 | 1,068.00 | 0.00 | 19,891.00 |
| total_compensation | 1,000 | 992.00 | 1,190.00 | 0.00 | 21,033.00 |
| age | 1,000 | 56.00 | 7.30 | 37 | 89 |
Note that several of the variables are missing from the table because they are not numeric.
typeof(execs$gender)
## [1] "integer"
typeof(execs$ethnicity)
## [1] "integer"
typeof(execs$ceo)
## [1] "integer"
Run the regressions properly for your independent variables. If it is a categorical or ordered variable such as gender, race, or education, model it as a series of dummy variables with a reference group. Do not model these types of covariates as continuous.
The exception is with ordered variables with a large number of categories. If education has 20 possibilities, such as 1 year of education, 2 years of education \(\hdots\), then it is generally okay to use that variable as continuous. If education is “Less than HS,” “HS/GED,” “Some College,” and “College Degree,” you need to model that as a categorical variable with a reference group.
Suppose The Data Was Not Already Encoded as Factor
execs_numeric <- execs %>%
mutate(gender = as.numeric(gender),
ethnicity = as.numeric(ethnicity),
ceo = as.numeric(ceo))
#Salary by Gender
lm_1 <- lm(salary ~ gender, data=execs_numeric)
#Salary by Age
lm_2 <- lm(salary ~ ceo, data=execs_numeric)
#Salary by Race/Ethnicity
lm_3 <- lm(salary ~ ethnicity, data=execs_numeric)
#Data Table
stargazer(lm_1, lm_2, lm_3, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Incorrect Linear Models of Executive Compensation",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | |||
| salary | |||
| (1) | (2) | (3) | |
| gender | 135.00* | ||
| (58.00) | |||
| ceo | -254.00*** | ||
| (28.00) | |||
| ethnicity | -95.00*** | ||
| (13.00) | |||
| Constant | 486.00*** | 1,094.00*** | 1,096.00*** |
| (114.00) | (40.00) | (50.00) | |
| Observations | 1,000 | 1,000 | 1,000 |
| R2 | 0.01 | 0.08 | 0.05 |
| Adjusted R2 | 0.004 | 0.08 | 0.05 |
| Residual Std. Error (df = 998) | 439.00 | 423.00 | 429.00 |
| F Statistic (df = 1; 998) | 5.40* | 83.00*** | 51.00*** |
| Note: | p<0.05; p<0.01; p<0.001 | ||
The only real difference that makes the approach incorrect is how it is encoded. Gender, CEO, and Ethnicity are treated was numeric. We know this because the data type for each is currently numeric (but should be factor).
Contrast these models with the corrected version with factors:
#Salary by Gender
lm_1 <- lm(salary ~ gender, data=execs)
#Salary by Age
lm_2 <- lm(salary ~ ceo, data=execs)
#Salary by Race/Ethnicity
lm_3 <- lm(salary ~ ethnicity, data=execs)
#Data Table
stargazer(lm_1, lm_2, lm_3, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Correct Linear Models of Executive Compensation",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | |||
| salary | |||
| (1) | (2) | (3) | |
| genderMALE | 135.00* | ||
| (58.00) | |||
| ceoOther Executive | -254.00*** | ||
| (28.00) | |||
| ethnicityASIAN | -133.00 | ||
| (151.00) | |||
| ethnicityCAUCASIAN | -73.00 | ||
| (116.00) | |||
| ethnicityHISPANIC | -371.00 | ||
| (444.00) | |||
| ethnicityUNKNOWN | -282.00* | ||
| (117.00) | |||
| Constant | 621.00*** | 839.00*** | 896.00*** |
| (57.00) | (17.00) | (115.00) | |
| Observations | 1,000 | 1,000 | 1,000 |
| R2 | 0.01 | 0.08 | 0.05 |
| Adjusted R2 | 0.004 | 0.08 | 0.05 |
| Residual Std. Error | 439.00 (df = 998) | 423.00 (df = 998) | 429.00 (df = 995) |
| F Statistic | 5.40* (df = 1; 998) | 83.00*** (df = 1; 998) | 14.00*** (df = 4; 995) |
| Note: | p<0.05; p<0.01; p<0.001 | ||
Now we will run a basic multiple regression model using OLS. The model will use several independent variables and covariates.
# Salary by Age, Gender, Ethnicity, and CEO
lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs)
#Data Table
stargazer(lm_1, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Multiple Regression of Executive Salary",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | |
| salary | |
| age | 4.00* |
| (1.80) | |
| genderMALE | 84.00 |
| (56.00) | |
| ethnicityASIAN | -150.00 |
| (146.00) | |
| ethnicityCAUCASIAN | -114.00 |
| (112.00) | |
| ethnicityHISPANIC | -276.00 |
| (429.00) | |
| ethnicityUNKNOWN | -289.00* |
| (113.00) | |
| ceoOther Executive | -227.00*** |
| (28.00) | |
| Constant | 701.00*** |
| (153.00) | |
| Observations | 1,000 |
| R2 | 0.12 |
| Adjusted R2 | 0.11 |
| Residual Std. Error | 414.00 (df = 992) |
| F Statistic | 19.00*** (df = 7; 992) |
| Note: | p<0.05; p<0.01; p<0.001 |
Note that the \(R^2\) is higher than in the single linear regression, indicating that we explain more variance.
Currently, the reference groups are arguably not the ideal ones. We might want males to be the reference group so we can see the effect for females; whites to be the reference race, so we can see the effect for African Americans; and other executives to be the reference group, so we can see the effect for CEOs. One way to do this is recoding the data:
kable(table(execs$ethnicity, execs$ceo))
| CEO | Other Executive | |
|---|---|---|
| AFRICAN-AMERICAN | 8 | 6 |
| ASIAN | 12 | 7 |
| CAUCASIAN | 430 | 183 |
| HISPANIC | 0 | 1 |
| UNKNOWN | 195 | 158 |
execs <- execs %>%
mutate(gender = factor(gender, levels = c("MALE", "FEMALE")),
ethnicity = factor(ethnicity,
levels = c("CAUCASIAN", "AFRICAN-AMERICAN", "HISPANIC", "ASIAN", "UNKNOWN")),
ceo = factor(ceo, levels = c("Other Executive", "CEO")))
knitr::kable(table(execs$ethnicity, execs$ceo))
| Other Executive | CEO | |
|---|---|---|
| CAUCASIAN | 183 | 430 |
| AFRICAN-AMERICAN | 6 | 8 |
| HISPANIC | 1 | 0 |
| ASIAN | 7 | 12 |
| UNKNOWN | 158 | 195 |
More information about changing factor order is available here.
# Salary by Age, Gender, Ethnicity, and CEO
lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs)
#Data Table
stargazer(lm_1, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Multiple Regression of Executive Salary",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | |
| salary | |
| age | 4.00* |
| (1.80) | |
| genderFEMALE | -84.00 |
| (56.00) | |
| ethnicityAFRICAN-AMERICAN | 114.00 |
| (112.00) | |
| ethnicityHISPANIC | -161.00 |
| (415.00) | |
| ethnicityASIAN | -36.00 |
| (97.00) | |
| ethnicityUNKNOWN | -174.00*** |
| (28.00) | |
| ceoCEO | 227.00*** |
| (28.00) | |
| Constant | 443.00*** |
| (107.00) | |
| Observations | 1,000 |
| R2 | 0.12 |
| Adjusted R2 | 0.11 |
| Residual Std. Error | 414.00 (df = 992) |
| F Statistic | 19.00*** (df = 7; 992) |
| Note: | p<0.05; p<0.01; p<0.001 |
We might expect that age does not have a strictly linear relationship with salary.
ggplot(execs) +
geom_point(aes(age, salary, color=gender), alpha = 0.25) +
geom_smooth(mapping = aes(age, salary), method="loess") +
xlab("Age") +
ylab("Salary in $1,000's") +
ggtitle("Executive Salary by Age and Gender") +
theme(plot.title = element_text(hjust = 0.5))
execs <- execs %>%
mutate(age_sq = age^2,
age_cb = age^3)
library(pander)
pander(summary(execs$gender))
| MALE | FEMALE |
|---|---|
| 940 | 60 |
pander(summary(execs$ethnicity))
| CAUCASIAN | AFRICAN-AMERICAN | HISPANIC | ASIAN | UNKNOWN |
|---|---|---|---|---|
| 613 | 14 | 1 | 19 | 353 |
pander(summary(execs$ceo))
| Other Executive | CEO |
|---|---|
| 355 | 645 |
We only have one Hispanic in the dataset, we should drop this to avoid future problems with the analysis.
execs <- filter(execs, ethnicity!="HISPANIC")
# Salary by Age, Gender, Ethnicity, and CEO
lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs)
# Adding Age Squared
lm_2 <- lm(salary ~ age + age_sq + gender + ethnicity + ceo, data=execs)
# Adding Age Cubed
lm_3 <- lm(salary ~ age + age_sq + age_cb + gender + ethnicity + ceo, data=execs)
#Data Table
stargazer(lm_1, lm_2, lm_3, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Multiple Regression of Executive Salary",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | |||
| salary | |||
| (1) | (2) | (3) | |
| age | 4.00* | 40.00* | 258.00* |
| (1.80) | (17.00) | (107.00) | |
| age_sq | -0.31* | -4.00* | |
| (0.14) | (1.80) | ||
| age_cb | 0.02* | ||
| (0.01) | |||
| genderFEMALE | -84.00 | -92.00 | -102.00 |
| (56.00) | (56.00) | (56.00) | |
| ethnicityAFRICAN-AMERICAN | 114.00 | 108.00 | 96.00 |
| (112.00) | (112.00) | (112.00) | |
| ethnicityASIAN | -36.00 | -34.00 | -39.00 |
| (97.00) | (97.00) | (97.00) | |
| ethnicityUNKNOWN | -174.00*** | -167.00*** | -166.00*** |
| (28.00) | (28.00) | (28.00) | |
| ceoCEO | 227.00*** | 220.00*** | 219.00*** |
| (28.00) | (28.00) | (28.00) | |
| Constant | 443.00*** | -602.00 | -4,802.00* |
| (107.00) | (483.00) | (2,092.00) | |
| Observations | 999 | 999 | 999 |
| R2 | 0.12 | 0.12 | 0.13 |
| Adjusted R2 | 0.11 | 0.12 | 0.12 |
| Residual Std. Error | 414.00 (df = 992) | 413.00 (df = 991) | 413.00 (df = 990) |
| F Statistic | 23.00*** (df = 6; 992) | 20.00*** (df = 7; 991) | 18.00*** (df = 8; 990) |
| Note: | p<0.05; p<0.01; p<0.001 | ||
#Melt from Reshape2
library(reshape2)
hist_data <- execs %>%
select(-c(id, year)) %>%
mutate(gender = as.numeric(gender),
ethnicity = as.numeric(ethnicity),
ceo = as.numeric(ceo)) %>%
melt()
ggplot(hist_data, aes(x = value)) +
facet_wrap(~variable, scales = "free_x") +
geom_histogram()
These histograms underscore several important points. First, they show what was indicated earlier, we should model gender, ethnicity, and ceo as categorical variables versus numeric. Second, we should consider using a log transformation of our dependent variable salary, (as wel as bonus and total_compensation).
Before we turn to this, let’s examine the data existing models for issues with outliers, influential observations, residual normality, homoskedasticity, multicolinearity, and non-independence of errors. More information about these diagonostics can be found here:
We can assess outliers and several other diagnostics using the library car.
library(car)
fit <- lm_1 # Establish which model to evaluate
outlierTest(fit) # Bonferonni p-value for most extreme obs
## rstudent unadjusted p-value
## 30 12.2 0.000000000000000000000000000000052
## 703 12.2 0.000000000000000000000000000000052
## 924 7.2 0.000000000000904130000000000005699
## 677 5.7 0.000000012012000000000000292349959
## 353 5.4 0.000000094697000000000004171347133
## 14 4.5 0.000008292799999999999612876680488
## 377 4.5 0.000008292799999999999612876680488
## 633 4.5 0.000008292799999999999612876680488
## 885 4.1 0.000041671000000000001334092558647
## Bonferonni p
## 30 0.000000000000000000000000000052
## 703 0.000000000000000000000000000052
## 924 0.000000000903229999999999992167
## 677 0.000012000000000000000304010289
## 353 0.000094601999999999999892835723
## 14 0.008284500000000000197175609173
## 377 0.008284500000000000197175609173
## 633 0.008284500000000000197175609173
## 885 0.041628999999999999337418898904
This test shows we have several outliers. We might consider ommiting some of these in future models.
qqPlot(fit, main="QQ Plot") #qq plot for studentized resid
This plot similarly shows we have considerable outliers in the data.
leveragePlots(fit) # leverage plots
Illustrates several outliers in modeling salary relative to several variables. We might also consider influential observations using AVPlots.
library(car)
avPlots(fit)
# Cook's D plot
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(execs)-length(fit$coefficients)-2))
plot(fit, which=4, cook.levels=cutoff)
Here, we can see several observations that are influential. We might want to consider removing these observations.
removeRows <- function(rowNum, data) {
newData <- data[-rowNum, , drop = FALSE]
rownames(newData) <- NULL
newData
}
execs_example <- removeRows(c(30, 505, 703), execs)
What does the plot look like now?
lm_1 <- lm(salary ~ age + gender + ethnicity + ceo, data=execs_example)
# Cook's D plot
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(execs)-length(lm_1$coefficients)-2))
plot(lm_1, which=4, cook.levels=cutoff)
Notice that we now have new outliers that appear. This is typically the case with the removal of outliers. Once we remove the most extreme values, new values will become the outliers for the new distribution.
# Evaluate Nonlinearity
# component + residual plot
crPlots(fit)
ncvTest applies a test of heteroskedasticity, known as a Breusch-Pagan Test. There is some good discussion of this relative to the test applied in Stata as well as R.
# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(fit)
# non-constant error variance test
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 59 Df = 1 p = 0.000000000000016
From this, we would conclude there is heteroskedasticity (which we would like to avoid).
Generally, we want to avoid multicolinearity since it can adversely affect model results. In some cases, you will witness coefficients with the opposite sign in a highly collinear model. VIF is one test for this. VIF’s > 10 or > 2 of the square-root of VIF often indicate a problem. A typical exception is for transformations of the same variable, such as a square or cube of a variable. Compare the VIF’s for Model 1 and Model 2 with and without the square term.
#Log LM 1
vif(lm_1) # variance inflation factors
## GVIF Df GVIF^(1/(2*Df))
## age 1 1 1
## gender 1 1 1
## ethnicity 1 3 1
## ceo 1 1 1
sqrt(vif(lm_1)) > 2 # problem?
## GVIF Df GVIF^(1/(2*Df))
## age FALSE FALSE FALSE
## gender FALSE FALSE FALSE
## ethnicity FALSE FALSE FALSE
## ceo FALSE FALSE FALSE
#Log LM 2
vif(lm_2) # variance inflation factors
## GVIF Df GVIF^(1/(2*Df))
## age 86.3 1 9.3
## age_sq 86.5 1 9.3
## gender 1.0 1 1.0
## ethnicity 1.0 3 1.0
## ceo 1.1 1 1.0
sqrt(vif(lm_2)) > 2 # problem?
## GVIF Df GVIF^(1/(2*Df))
## age TRUE FALSE TRUE
## age_sq TRUE FALSE TRUE
## gender FALSE FALSE FALSE
## ethnicity FALSE FALSE FALSE
## ceo FALSE FALSE FALSE
Recalling the prior histograms, some of the issue may be the distribution of salary and age. Let’s transform the data using log transforms.
# natural log transform salary, age, age_sq, age_cb
execs <- execs %>%
mutate(log_salary = log(salary+1),
log_age = log(age),
log_age_sq = log_age^2,
log_age_cb = log_age^3)
# remove observations suggested by prior Cook's test
execs <- removeRows(c(30, 505, 703), execs)
# Salary by Age, Gender, Ethnicity, and CEO
log_lm_1 <- lm(log_salary ~ log_age + gender + ethnicity + ceo, data=execs)
# Adding Age Squared
log_lm_2 <- lm(log_salary ~ log_age + log_age_sq + gender + ethnicity + ceo, data=execs)
# Adding Age Cubed
log_lm_3 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ethnicity + ceo, data=execs)
#Data Table
stargazer(log_lm_1, log_lm_2, log_lm_3, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Multiple Regression of Executive Salary",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | |||
| log_salary | |||
| (1) | (2) | (3) | |
| log_age | -0.10 | 11.00 | 420.00* |
| (0.20) | (7.60) | (187.00) | |
| log_age_sq | -1.40 | -103.00* | |
| (0.94) | (46.00) | ||
| log_age_cb | 8.40* | ||
| (3.80) | |||
| genderFEMALE | -0.03 | -0.05 | -0.06 |
| (0.11) | (0.11) | (0.11) | |
| ethnicityAFRICAN-AMERICAN | 0.31 | 0.29 | 0.27 |
| (0.22) | (0.22) | (0.22) | |
| ethnicityASIAN | 0.003 | 0.002 | -0.01 |
| (0.19) | (0.19) | (0.19) | |
| ethnicityUNKNOWN | -0.21*** | -0.20*** | -0.20*** |
| (0.05) | (0.05) | (0.05) | |
| ceoCEO | 0.40*** | 0.38*** | 0.39*** |
| (0.05) | (0.05) | (0.05) | |
| Constant | 6.70*** | -17.00 | -564.00* |
| (0.81) | (15.00) | (252.00) | |
| Observations | 996 | 996 | 996 |
| R2 | 0.08 | 0.08 | 0.08 |
| Adjusted R2 | 0.07 | 0.07 | 0.08 |
| Residual Std. Error | 0.81 (df = 989) | 0.81 (df = 988) | 0.80 (df = 987) |
| F Statistic | 14.00*** (df = 6; 989) | 12.00*** (df = 7; 988) | 11.00*** (df = 8; 987) |
| Note: | p<0.05; p<0.01; p<0.001 | ||
library(car)
#Salary
qqPlot(execs$salary, main="QQ Plot Salary")
qqPlot(execs$log_salary, main="QQ Plot Log Salary")
#Age
qqPlot(execs$age, main="QQ Plot Age")
qqPlot(execs$log_age, main="QQ Plot Log Age")
We can now even more clearly see the outliers. Let’s do a scatterplot:
ggplot(execs, aes(age, log_salary)) +
geom_point(alpha=0.2)
Recall from the original log transformation that we added 1 to avoid dropping those with zero salary. Clearly, these were outliers. Let’s correct this and omit missing salaries with the log transform.
Since the outliers are all log_salaries less than 0, let’s remove them:
execs <- execs %>%
filter(log_salary > 0,
complete.cases(.))
qqPlot(execs$log_salary, main="QQ Plot Log Salary")
ggplot(execs, aes(age, log_salary)) +
geom_point(alpha=0.2)
How have the regressions changed?
# Salary by Age, Gender, Ethnicity, and CEO
log_lm_1 <- lm(log_salary ~ log_age + gender + ethnicity + ceo, data=execs)
# Adding Age Squared
log_lm_2 <- lm(log_salary ~ log_age + log_age_sq + gender + ethnicity + ceo, data=execs)
# Adding Age Cubed
log_lm_3 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ethnicity + ceo, data=execs)
#Data Table
stargazer(log_lm_1, log_lm_2, log_lm_3, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Multiple Regression of Executive Salary",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | |||
| log_salary | |||
| (1) | (2) | (3) | |
| log_age | 0.13 | 10.00 | 312.00* |
| (0.16) | (6.00) | (149.00) | |
| log_age_sq | -1.30 | -76.00* | |
| (0.75) | (37.00) | ||
| log_age_cb | 6.20* | ||
| (3.00) | |||
| genderFEMALE | -0.08 | -0.09 | -0.10 |
| (0.09) | (0.09) | (0.09) | |
| ethnicityAFRICAN-AMERICAN | 0.27 | 0.26 | 0.24 |
| (0.17) | (0.17) | (0.17) | |
| ethnicityASIAN | -0.03 | -0.03 | -0.04 |
| (0.15) | (0.15) | (0.15) | |
| ethnicityUNKNOWN | -0.22*** | -0.21*** | -0.22*** |
| (0.04) | (0.04) | (0.04) | |
| ceoCEO | 0.34*** | 0.33*** | 0.33*** |
| (0.04) | (0.04) | (0.04) | |
| Constant | 5.80*** | -15.00 | -419.00* |
| (0.64) | (12.00) | (200.00) | |
| Observations | 990 | 990 | 990 |
| R2 | 0.10 | 0.10 | 0.11 |
| Adjusted R2 | 0.10 | 0.10 | 0.10 |
| Residual Std. Error | 0.64 (df = 983) | 0.64 (df = 982) | 0.64 (df = 981) |
| F Statistic | 18.00*** (df = 6; 983) | 16.00*** (df = 7; 982) | 15.00*** (df = 8; 981) |
| Note: | p<0.05; p<0.01; p<0.001 | ||
From these new results, it appears that log_lm_2 is the best model. Let’s do some further outlier tests on the model:
# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(log_lm_1)
# non-constant error variance test
ncvTest(log_lm_1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 3 Df = 1 p = 0.083
# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(log_lm_2)
# non-constant error variance test
ncvTest(log_lm_2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.9 Df = 1 p = 0.089
# Evaluate homoscedasticity
par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(log_lm_3)
# non-constant error variance test
ncvTest(log_lm_3)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.0013 Df = 1 p = 0.97
So now each model is no longer heteroskedastic, and model 3 appears to be performing the best. Do we still have outliers for each model?
# Bonferonni p-value for most extreme obs (Log Models 1-3)
outlierTest(log_lm_1)
## rstudent unadjusted p-value Bonferonni p
## 306 -11 0.0000000000000000000000000045 0.0000000000000000000000045
## 489 -11 0.0000000000000000000000000045 0.0000000000000000000000045
## 775 -11 0.0000000000000000000000000045 0.0000000000000000000000045
## 92 -10 0.0000000000000000000000872250 0.0000000000000000000863530
outlierTest(log_lm_2)
## rstudent unadjusted p-value Bonferonni p
## 306 -11 0.0000000000000000000000000041 0.0000000000000000000000041
## 489 -11 0.0000000000000000000000000041 0.0000000000000000000000041
## 775 -11 0.0000000000000000000000000041 0.0000000000000000000000041
## 92 -10 0.0000000000000000000000559110 0.0000000000000000000553520
outlierTest(log_lm_3)
## rstudent unadjusted p-value Bonferonni p
## 306 -11 0.0000000000000000000000000061 0.000000000000000000000006
## 489 -11 0.0000000000000000000000000061 0.000000000000000000000006
## 775 -11 0.0000000000000000000000000061 0.000000000000000000000006
## 92 -10 0.0000000000000000000000800900 0.000000000000000000079289
Each test reveals the same outliers we might want to consider removing. One approach is removing them.
dim(execs)
## [1] 990 15
execs_example <- removeRows(c(306, 489, 775, 92), execs)
Another approach, however is using robust regression, which downweights outliers.
library(MASS)
# Salary by Age, Gender, Ethnicity, and CEO
log_lm_1 <- lm(log_salary ~ log_age + gender + ethnicity + ceo, data=execs)
# Adding Age Squared
log_lm_2 <- lm(log_salary ~ log_age + log_age_sq + gender + ethnicity + ceo, data=execs)
# Adding Age Squared - Ethnicity
log_lm_2_1 <- lm(log_salary ~ log_age + log_age_sq + gender + ceo, data=execs)
# Adding Age Cubed
log_lm_3 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ethnicity + ceo, data=execs)
# Model Without Ethnicity
log_lm_4 <- lm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ceo, data=execs)
# Robust Regression Log Model 4
robust_log_lm_4 <- rlm(log_salary ~ log_age + log_age_sq + log_age_cb + gender + ceo, data=execs)
robust_log_lm_5 <- rlm(log_salary ~ log_age + log_age_sq + gender + ceo, data=execs)
robust_log_lm_6 <- rlm(log_salary ~ log_age + gender + ceo, data=execs)
#Data Table
stargazer(log_lm_1, log_lm_2, log_lm_2_1, log_lm_3, log_lm_4, robust_log_lm_4, robust_log_lm_5, robust_log_lm_6, header=FALSE,
type = 'html',
font.size = "footnotesize",
digits = 2,
title = "Multiple Regression of Executive Salary",
star.cutoffs = c(0.05, 0.01, 0.001))
| Dependent variable: | ||||||||
| log_salary | ||||||||
| OLS | robust | |||||||
| linear | ||||||||
| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
| log_age | 0.13 | 10.00 | 14.00* | 312.00* | 317.00* | 186.00 | 15.00*** | 0.42*** |
| (0.16) | (6.00) | (6.00) | (149.00) | (150.00) | (105.00) | (4.20) | (0.11) | |
| log_age_sq | -1.30 | -1.70* | -76.00* | -77.00* | -44.00 | -1.80*** | ||
| (0.75) | (0.75) | (37.00) | (37.00) | (26.00) | (0.53) | |||
| log_age_cb | 6.20* | 6.20* | 3.50 | |||||
| (3.00) | (3.10) | (2.20) | ||||||
| genderFEMALE | -0.08 | -0.09 | -0.07 | -0.10 | -0.08 | -0.10 | -0.10 | -0.08 |
| (0.09) | (0.09) | (0.09) | (0.09) | (0.09) | (0.06) | (0.06) | (0.06) | |
| ethnicityAFRICAN-AMERICAN | 0.27 | 0.26 | 0.24 | |||||
| (0.17) | (0.17) | (0.17) | ||||||
| ethnicityASIAN | -0.03 | -0.03 | -0.04 | |||||
| (0.15) | (0.15) | (0.15) | ||||||
| ethnicityUNKNOWN | -0.22*** | -0.21*** | -0.22*** | |||||
| (0.04) | (0.04) | (0.04) | ||||||
| ceoCEO | 0.34*** | 0.33*** | 0.36*** | 0.33*** | 0.36*** | 0.37*** | 0.37*** | 0.38*** |
| (0.04) | (0.04) | (0.04) | (0.04) | (0.04) | (0.03) | (0.03) | (0.03) | |
| Constant | 5.80*** | -15.00 | -22.00 | -419.00* | -429.00* | -254.00 | -24.00** | 4.60*** |
| (0.64) | (12.00) | (12.00) | (200.00) | (202.00) | (141.00) | (8.50) | (0.45) | |
| Observations | 990 | 990 | 990 | 990 | 990 | 990 | 990 | 990 |
| R2 | 0.10 | 0.10 | 0.08 | 0.11 | 0.08 | |||
| Adjusted R2 | 0.10 | 0.10 | 0.07 | 0.10 | 0.08 | |||
| Residual Std. Error | 0.64 (df = 983) | 0.64 (df = 982) | 0.65 (df = 985) | 0.64 (df = 981) | 0.65 (df = 984) | 0.41 (df = 984) | 0.42 (df = 985) | 0.43 (df = 986) |
| F Statistic | 18.00*** (df = 6; 983) | 16.00*** (df = 7; 982) | 21.00*** (df = 4; 985) | 15.00*** (df = 8; 981) | 18.00*** (df = 5; 984) | |||
| Note: | p<0.05; p<0.01; p<0.001 | |||||||